Introduction

My data: https://www.kaggle.com/uciml/student-alcohol-consumption

This dataset contains a survey of students in a Math course and a Portuguese language course at two secondary schools in Portugal (Gabriel Pereira and Mousinho da Silveira). The dataframes contains a lot of important and interesting social, gender and study information regarding the students at the schools.

The contributor of the data suggested that the data could be great for running predictive models such as predicting students final grades. Besides the recommendation from the data’s poster, there were a couple other reasons why I choose the dataset. This dataset seemed intresting and easy to work with (including pretty decent metadata), looked like it had enough data with dependent and independent variables to do multiple regression, and contained two csv’s allowing me to perform a join. These factors made me think it would be perfect for this midterm assingment.

I hope to learn about what socioeconimic and social backgrounds can impact students ability to perform in acadamia and/or find factors that lead to alcohol consumption. I also would be intrested in looking at the differences between each school, maybe again looking at grades and alcohol consumption.

Importing the data

Before we get to the fun stuff I have to actaully put my data in R…

setwd("C:/Users/carso/OneDrive/Desktop/MARE 375 Midterm")
library(readr)
student_mat <- read_csv("student-mat.csv")
library(readr)
student_por <- read_csv("student-por.csv")

Changing names for the two dataframes, because I don’t want to type more than needed.

sm <- student_mat
sp <- student_por
sm
## # A tibble: 395 x 33
##    school sex     age address famsize Pstatus  Medu  Fedu Mjob  Fjob 
##    <chr>  <chr> <dbl> <chr>   <chr>   <chr>   <dbl> <dbl> <chr> <chr>
##  1 GP     F        18 U       GT3     A           4     4 at_h~ teac~
##  2 GP     F        17 U       GT3     T           1     1 at_h~ other
##  3 GP     F        15 U       LE3     T           1     1 at_h~ other
##  4 GP     F        15 U       GT3     T           4     2 heal~ serv~
##  5 GP     F        16 U       GT3     T           3     3 other other
##  6 GP     M        16 U       LE3     T           4     3 serv~ other
##  7 GP     M        16 U       LE3     T           2     2 other other
##  8 GP     F        17 U       GT3     A           4     4 other teac~
##  9 GP     M        15 U       LE3     A           3     2 serv~ other
## 10 GP     M        15 U       GT3     T           3     4 other other
## # ... with 385 more rows, and 23 more variables: reason <chr>,
## #   guardian <chr>, traveltime <dbl>, studytime <dbl>, failures <dbl>,
## #   schoolsup <chr>, famsup <chr>, paid <chr>, activities <chr>,
## #   nursery <chr>, higher <chr>, internet <chr>, romantic <chr>,
## #   famrel <dbl>, freetime <dbl>, goout <dbl>, Dalc <dbl>, Walc <dbl>,
## #   health <dbl>, absences <dbl>, G1 <dbl>, G2 <dbl>, G3 <dbl>
sp
## # A tibble: 649 x 33
##    school sex     age address famsize Pstatus  Medu  Fedu Mjob  Fjob 
##    <chr>  <chr> <dbl> <chr>   <chr>   <chr>   <dbl> <dbl> <chr> <chr>
##  1 GP     F        18 U       GT3     A           4     4 at_h~ teac~
##  2 GP     F        17 U       GT3     T           1     1 at_h~ other
##  3 GP     F        15 U       LE3     T           1     1 at_h~ other
##  4 GP     F        15 U       GT3     T           4     2 heal~ serv~
##  5 GP     F        16 U       GT3     T           3     3 other other
##  6 GP     M        16 U       LE3     T           4     3 serv~ other
##  7 GP     M        16 U       LE3     T           2     2 other other
##  8 GP     F        17 U       GT3     A           4     4 other teac~
##  9 GP     M        15 U       LE3     A           3     2 serv~ other
## 10 GP     M        15 U       GT3     T           3     4 other other
## # ... with 639 more rows, and 23 more variables: reason <chr>,
## #   guardian <chr>, traveltime <dbl>, studytime <dbl>, failures <dbl>,
## #   schoolsup <chr>, famsup <chr>, paid <chr>, activities <chr>,
## #   nursery <chr>, higher <chr>, internet <chr>, romantic <chr>,
## #   famrel <dbl>, freetime <dbl>, goout <dbl>, Dalc <dbl>, Walc <dbl>,
## #   health <dbl>, absences <dbl>, G1 <dbl>, G2 <dbl>, G3 <dbl>

Content of the dataframes

By using the str() command, I am able to see how many observations and variables are in my dataframes and the type of data :

str(sm)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 395 obs. of  33 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : num  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : num  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : num  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: num  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : num  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : num  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ paid      : chr  "no" "no" "yes" "yes" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : num  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : num  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : num  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : num  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : num  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : num  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : num  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : num  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : num  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : num  6 6 10 15 10 15 11 6 19 15 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   school = col_character(),
##   ..   sex = col_character(),
##   ..   age = col_double(),
##   ..   address = col_character(),
##   ..   famsize = col_character(),
##   ..   Pstatus = col_character(),
##   ..   Medu = col_double(),
##   ..   Fedu = col_double(),
##   ..   Mjob = col_character(),
##   ..   Fjob = col_character(),
##   ..   reason = col_character(),
##   ..   guardian = col_character(),
##   ..   traveltime = col_double(),
##   ..   studytime = col_double(),
##   ..   failures = col_double(),
##   ..   schoolsup = col_character(),
##   ..   famsup = col_character(),
##   ..   paid = col_character(),
##   ..   activities = col_character(),
##   ..   nursery = col_character(),
##   ..   higher = col_character(),
##   ..   internet = col_character(),
##   ..   romantic = col_character(),
##   ..   famrel = col_double(),
##   ..   freetime = col_double(),
##   ..   goout = col_double(),
##   ..   Dalc = col_double(),
##   ..   Walc = col_double(),
##   ..   health = col_double(),
##   ..   absences = col_double(),
##   ..   G1 = col_double(),
##   ..   G2 = col_double(),
##   ..   G3 = col_double()
##   .. )
str(sp)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 649 obs. of  33 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : num  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : num  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : num  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: num  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : num  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ paid      : chr  "no" "no" "no" "no" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : num  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : num  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : num  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : num  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : num  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : num  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : num  4 2 6 0 0 6 0 2 0 0 ...
##  $ G1        : num  0 9 12 14 11 12 13 10 15 12 ...
##  $ G2        : num  11 11 13 14 13 12 12 13 16 12 ...
##  $ G3        : num  11 11 12 14 13 13 13 13 17 13 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   school = col_character(),
##   ..   sex = col_character(),
##   ..   age = col_double(),
##   ..   address = col_character(),
##   ..   famsize = col_character(),
##   ..   Pstatus = col_character(),
##   ..   Medu = col_double(),
##   ..   Fedu = col_double(),
##   ..   Mjob = col_character(),
##   ..   Fjob = col_character(),
##   ..   reason = col_character(),
##   ..   guardian = col_character(),
##   ..   traveltime = col_double(),
##   ..   studytime = col_double(),
##   ..   failures = col_double(),
##   ..   schoolsup = col_character(),
##   ..   famsup = col_character(),
##   ..   paid = col_character(),
##   ..   activities = col_character(),
##   ..   nursery = col_character(),
##   ..   higher = col_character(),
##   ..   internet = col_character(),
##   ..   romantic = col_character(),
##   ..   famrel = col_double(),
##   ..   freetime = col_double(),
##   ..   goout = col_double(),
##   ..   Dalc = col_double(),
##   ..   Walc = col_double(),
##   ..   health = col_double(),
##   ..   absences = col_double(),
##   ..   G1 = col_double(),
##   ..   G2 = col_double(),
##   ..   G3 = col_double()
##   .. )

Attributes for both dataframes:

school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)

sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)

age - student’s age (numeric: from 15 to 22)

address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)

famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)

Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)

Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)

Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)

Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)

Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)

reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)

guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)

traveltime - home to school travel time (numeric: 1 - 3 hours)

studytime - weekly study time (numeric: 1 - 10 hours)

failures - number of past class failures (numeric: n if 1<=n<3, else 4)

schoolsup - extra educational support (binary: yes or no)

famsup - family educational support (binary: yes or no)

paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)

activities - extra-curricular activities (binary: yes or no)

nursery - attended nursery school (binary: yes or no)

higher - wants to take higher education (binary: yes or no)

internet - Internet access at home (binary: yes or no)

romantic - with a romantic relationship (binary: yes or no)

famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)

freetime - free time after school (numeric: from 1 - very low to 5 - very high)

goout - going out with friends (numeric: from 1 - very low to 5 - very high)

Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)

Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)

health - current health status (numeric: from 1 - very bad to 5 - very good)

absences - number of school absences (numeric: from 0 to 93)

Tidy?

Let’s see if the data is Tidy!

Math Course
head(sm)
## # A tibble: 6 x 33
##   school sex     age address famsize Pstatus  Medu  Fedu Mjob  Fjob  reason
##   <chr>  <chr> <dbl> <chr>   <chr>   <chr>   <dbl> <dbl> <chr> <chr> <chr> 
## 1 GP     F        18 U       GT3     A           4     4 at_h~ teac~ course
## 2 GP     F        17 U       GT3     T           1     1 at_h~ other course
## 3 GP     F        15 U       LE3     T           1     1 at_h~ other other 
## 4 GP     F        15 U       GT3     T           4     2 heal~ serv~ home  
## 5 GP     F        16 U       GT3     T           3     3 other other home  
## 6 GP     M        16 U       LE3     T           4     3 serv~ other reput~
## # ... with 22 more variables: guardian <chr>, traveltime <dbl>,
## #   studytime <dbl>, failures <dbl>, schoolsup <chr>, famsup <chr>,
## #   paid <chr>, activities <chr>, nursery <chr>, higher <chr>,
## #   internet <chr>, romantic <chr>, famrel <dbl>, freetime <dbl>,
## #   goout <dbl>, Dalc <dbl>, Walc <dbl>, health <dbl>, absences <dbl>,
## #   G1 <dbl>, G2 <dbl>, G3 <dbl>
Portuguese Language Course
head(sp)
## # A tibble: 6 x 33
##   school sex     age address famsize Pstatus  Medu  Fedu Mjob  Fjob  reason
##   <chr>  <chr> <dbl> <chr>   <chr>   <chr>   <dbl> <dbl> <chr> <chr> <chr> 
## 1 GP     F        18 U       GT3     A           4     4 at_h~ teac~ course
## 2 GP     F        17 U       GT3     T           1     1 at_h~ other course
## 3 GP     F        15 U       LE3     T           1     1 at_h~ other other 
## 4 GP     F        15 U       GT3     T           4     2 heal~ serv~ home  
## 5 GP     F        16 U       GT3     T           3     3 other other home  
## 6 GP     M        16 U       LE3     T           4     3 serv~ other reput~
## # ... with 22 more variables: guardian <chr>, traveltime <dbl>,
## #   studytime <dbl>, failures <dbl>, schoolsup <chr>, famsup <chr>,
## #   paid <chr>, activities <chr>, nursery <chr>, higher <chr>,
## #   internet <chr>, romantic <chr>, famrel <dbl>, freetime <dbl>,
## #   goout <dbl>, Dalc <dbl>, Walc <dbl>, health <dbl>, absences <dbl>,
## #   G1 <dbl>, G2 <dbl>, G3 <dbl>

In my data, each variable has its own column and each observation hsd its own row.

Sadly, I don’t know if my data has unique keys.

I can’t move forward until my data is 100% tidy, so I will check for unique keys

First

I will want to see if my data has any primary keys

I am going to use many identifiers that I think will help me reach a primary key:

Math Course

# Making sure my packages are loaded
detach(package:readr, unload = TRUE)
library(tidyverse)
## Counting repeated observations to see if I have a primary key

sm %>%
count(school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,nursery,internet) %>%
filter(n >1)
## # A tibble: 4 x 14
##   school sex     age address famsize Pstatus  Medu  Fedu Mjob  Fjob  reason
##   <chr>  <chr> <dbl> <chr>   <chr>   <chr>   <dbl> <dbl> <chr> <chr> <chr> 
## 1 GP     M        16 U       GT3     T           3     3 serv~ other home  
## 2 GP     M        16 U       GT3     T           4     4 teac~ teac~ course
## 3 GP     M        17 U       GT3     T           2     1 other other home  
## 4 GP     M        18 U       GT3     T           4     4 teac~ serv~ home  
## # ... with 3 more variables: nursery <chr>, internet <chr>, n <int>

Portuguese Language Course

sp %>%
count(school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,nursery,internet) %>%
filter(n >1)
## # A tibble: 11 x 14
##    school sex     age address famsize Pstatus  Medu  Fedu Mjob  Fjob 
##    <chr>  <chr> <dbl> <chr>   <chr>   <chr>   <dbl> <dbl> <chr> <chr>
##  1 GP     F        17 U       GT3     T           4     4 teac~ serv~
##  2 GP     F        18 U       GT3     T           1     1 other other
##  3 GP     M        16 U       GT3     T           3     3 serv~ other
##  4 GP     M        16 U       GT3     T           4     4 serv~ serv~
##  5 GP     M        16 U       GT3     T           4     4 teac~ teac~
##  6 GP     M        17 U       GT3     T           2     1 other other
##  7 GP     M        18 U       GT3     T           4     4 teac~ serv~
##  8 MS     F        16 R       GT3     T           2     2 other other
##  9 MS     F        16 U       GT3     T           1     2 other serv~
## 10 MS     F        18 U       GT3     T           1     1 other other
## 11 MS     M        15 R       LE3     T           4     1 heal~ serv~
## # ... with 4 more variables: reason <chr>, nursery <chr>, internet <chr>,
## #   n <int>

It looks like with this set of identifiers, we do not have a primary key

Let me add in the grades to see if I can get a primary key

Math Course

sm %>%
count(school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,nursery,internet, G1,G2,G3) %>%
filter(n >1)
## # A tibble: 0 x 17
## # ... with 17 variables: school <chr>, sex <chr>, age <dbl>,
## #   address <chr>, famsize <chr>, Pstatus <chr>, Medu <dbl>, Fedu <dbl>,
## #   Mjob <chr>, Fjob <chr>, reason <chr>, nursery <chr>, internet <chr>,
## #   G1 <dbl>, G2 <dbl>, G3 <dbl>, n <int>

Portuguese Language Course

sp %>%
count(school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,nursery,internet, G1,G2,G3) %>%
filter(n >1)
## # A tibble: 1 x 17
##   school sex     age address famsize Pstatus  Medu  Fedu Mjob  Fjob  reason
##   <chr>  <chr> <dbl> <chr>   <chr>   <chr>   <dbl> <dbl> <chr> <chr> <chr> 
## 1 MS     F        16 R       GT3     T           2     2 other other course
## # ... with 6 more variables: nursery <chr>, internet <chr>, G1 <dbl>,
## #   G2 <dbl>, G3 <dbl>, n <int>

It looks like I still don’t have a primary key for both schools :(

Time to make a surrogate key

library(dplyr)

#### Adding ID's to my data 

smi <- mutate(sm,id = dplyr::row_number(G1))
spi <- mutate(sp,id = dplyr::row_number(G1))

Made ID’s by grade because it was rather variable

Printing my new data frames
smi
## # A tibble: 395 x 34
##    school sex     age address famsize Pstatus  Medu  Fedu Mjob  Fjob 
##    <chr>  <chr> <dbl> <chr>   <chr>   <chr>   <dbl> <dbl> <chr> <chr>
##  1 GP     F        18 U       GT3     A           4     4 at_h~ teac~
##  2 GP     F        17 U       GT3     T           1     1 at_h~ other
##  3 GP     F        15 U       LE3     T           1     1 at_h~ other
##  4 GP     F        15 U       GT3     T           4     2 heal~ serv~
##  5 GP     F        16 U       GT3     T           3     3 other other
##  6 GP     M        16 U       LE3     T           4     3 serv~ other
##  7 GP     M        16 U       LE3     T           2     2 other other
##  8 GP     F        17 U       GT3     A           4     4 other teac~
##  9 GP     M        15 U       LE3     A           3     2 serv~ other
## 10 GP     M        15 U       GT3     T           3     4 other other
## # ... with 385 more rows, and 24 more variables: reason <chr>,
## #   guardian <chr>, traveltime <dbl>, studytime <dbl>, failures <dbl>,
## #   schoolsup <chr>, famsup <chr>, paid <chr>, activities <chr>,
## #   nursery <chr>, higher <chr>, internet <chr>, romantic <chr>,
## #   famrel <dbl>, freetime <dbl>, goout <dbl>, Dalc <dbl>, Walc <dbl>,
## #   health <dbl>, absences <dbl>, G1 <dbl>, G2 <dbl>, G3 <dbl>, id <int>

Primary key check:

 smi %>% count(id) %>% filter(n >1)
## # A tibble: 0 x 2
## # ... with 2 variables: id <int>, n <int>
spi
## # A tibble: 649 x 34
##    school sex     age address famsize Pstatus  Medu  Fedu Mjob  Fjob 
##    <chr>  <chr> <dbl> <chr>   <chr>   <chr>   <dbl> <dbl> <chr> <chr>
##  1 GP     F        18 U       GT3     A           4     4 at_h~ teac~
##  2 GP     F        17 U       GT3     T           1     1 at_h~ other
##  3 GP     F        15 U       LE3     T           1     1 at_h~ other
##  4 GP     F        15 U       GT3     T           4     2 heal~ serv~
##  5 GP     F        16 U       GT3     T           3     3 other other
##  6 GP     M        16 U       LE3     T           4     3 serv~ other
##  7 GP     M        16 U       LE3     T           2     2 other other
##  8 GP     F        17 U       GT3     A           4     4 other teac~
##  9 GP     M        15 U       LE3     A           3     2 serv~ other
## 10 GP     M        15 U       GT3     T           3     4 other other
## # ... with 639 more rows, and 24 more variables: reason <chr>,
## #   guardian <chr>, traveltime <dbl>, studytime <dbl>, failures <dbl>,
## #   schoolsup <chr>, famsup <chr>, paid <chr>, activities <chr>,
## #   nursery <chr>, higher <chr>, internet <chr>, romantic <chr>,
## #   famrel <dbl>, freetime <dbl>, goout <dbl>, Dalc <dbl>, Walc <dbl>,
## #   health <dbl>, absences <dbl>, G1 <dbl>, G2 <dbl>, G3 <dbl>, id <int>

Primary key check:

 spi %>% count(id) %>% filter(n >1)
## # A tibble: 0 x 2
## # ... with 2 variables: id <int>, n <int>

Looks like I have made a primary key!

Imputations

Okay time for imputations!

Loading the packages I need!

detach(package:tidyverse, unload = TRUE)
library(mice)

Looking for missing data in my dataframes

Math Course

md.pattern(smi)
##  /\     /\
## {  `---'  }
## {  O   O  }
## ==>  V <==  No need for mice. This data set is completely observed.
##  \  \|/  /
##   `-----'

##     school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## 395      1   1   1       1       1       1    1    1    1    1      1
##          0   0   0       0       0       0    0    0    0    0      0
##     guardian traveltime studytime failures schoolsup famsup paid
## 395        1          1         1        1         1      1    1
##            0          0         0        0         0      0    0
##     activities nursery higher internet romantic famrel freetime goout Dalc
## 395          1       1      1        1        1      1        1     1    1
##              0       0      0        0        0      0        0     0    0
##     Walc health absences G1 G2 G3 id  
## 395    1      1        1  1  1  1  1 0
##        0      0        0  0  0  0  0 0

Portuguese Language Course

md.pattern(spi)
##  /\     /\
## {  `---'  }
## {  O   O  }
## ==>  V <==  No need for mice. This data set is completely observed.
##  \  \|/  /
##   `-----'

##     school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## 649      1   1   1       1       1       1    1    1    1    1      1
##          0   0   0       0       0       0    0    0    0    0      0
##     guardian traveltime studytime failures schoolsup famsup paid
## 649        1          1         1        1         1      1    1
##            0          0         0        0         0      0    0
##     activities nursery higher internet romantic famrel freetime goout Dalc
## 649          1       1      1        1        1      1        1     1    1
##              0       0      0        0        0      0        0     0    0
##     Walc health absences G1 G2 G3 id  
## 649    1      1        1  1  1  1  1 0
##        0      0        0  0  0  0  0 0

My data doesn’t have any missing values, so I will have to delete some stuff before I make imputations!

Deleting stuff and making NA!

library(naniar)

smi2 <- smi %>% replace_with_na_all(condition = ~.x == 2)
spi2 <- spi %>% replace_with_na_all(condition = ~.x == 2)

Decided to replace all twos with NA cause that seemed fun

smi2 
## # A tibble: 395 x 34
##    school sex     age address famsize Pstatus  Medu  Fedu Mjob  Fjob 
##    <chr>  <chr> <dbl> <chr>   <chr>   <chr>   <dbl> <dbl> <chr> <chr>
##  1 GP     F        18 U       GT3     A           4     4 at_h~ teac~
##  2 GP     F        17 U       GT3     T           1     1 at_h~ other
##  3 GP     F        15 U       LE3     T           1     1 at_h~ other
##  4 GP     F        15 U       GT3     T           4    NA heal~ serv~
##  5 GP     F        16 U       GT3     T           3     3 other other
##  6 GP     M        16 U       LE3     T           4     3 serv~ other
##  7 GP     M        16 U       LE3     T          NA    NA other other
##  8 GP     F        17 U       GT3     A           4     4 other teac~
##  9 GP     M        15 U       LE3     A           3    NA serv~ other
## 10 GP     M        15 U       GT3     T           3     4 other other
## # ... with 385 more rows, and 24 more variables: reason <chr>,
## #   guardian <chr>, traveltime <dbl>, studytime <dbl>, failures <dbl>,
## #   schoolsup <chr>, famsup <chr>, paid <chr>, activities <chr>,
## #   nursery <chr>, higher <chr>, internet <chr>, romantic <chr>,
## #   famrel <dbl>, freetime <dbl>, goout <dbl>, Dalc <dbl>, Walc <dbl>,
## #   health <dbl>, absences <dbl>, G1 <dbl>, G2 <dbl>, G3 <dbl>, id <int>
spi2
## # A tibble: 649 x 34
##    school sex     age address famsize Pstatus  Medu  Fedu Mjob  Fjob 
##    <chr>  <chr> <dbl> <chr>   <chr>   <chr>   <dbl> <dbl> <chr> <chr>
##  1 GP     F        18 U       GT3     A           4     4 at_h~ teac~
##  2 GP     F        17 U       GT3     T           1     1 at_h~ other
##  3 GP     F        15 U       LE3     T           1     1 at_h~ other
##  4 GP     F        15 U       GT3     T           4    NA heal~ serv~
##  5 GP     F        16 U       GT3     T           3     3 other other
##  6 GP     M        16 U       LE3     T           4     3 serv~ other
##  7 GP     M        16 U       LE3     T          NA    NA other other
##  8 GP     F        17 U       GT3     A           4     4 other teac~
##  9 GP     M        15 U       LE3     A           3    NA serv~ other
## 10 GP     M        15 U       GT3     T           3     4 other other
## # ... with 639 more rows, and 24 more variables: reason <chr>,
## #   guardian <chr>, traveltime <dbl>, studytime <dbl>, failures <dbl>,
## #   schoolsup <chr>, famsup <chr>, paid <chr>, activities <chr>,
## #   nursery <chr>, higher <chr>, internet <chr>, romantic <chr>,
## #   famrel <dbl>, freetime <dbl>, goout <dbl>, Dalc <dbl>, Walc <dbl>,
## #   health <dbl>, absences <dbl>, G1 <dbl>, G2 <dbl>, G3 <dbl>, id <int>

Performing the imputation!

Using PMM (Predictive Mean Matching) for method, because I am doing the imputation on numeric variables.

“smii <- mice(smi2, m = 5, maxit = 50, method = ‘pmm’, seed = 500)”

“spii <- mice(spi2, m = 5, maxit = 50, method = ‘pmm’, seed = 500)”

This isn’t in a coding box because the output is way too long, but I ran it below without showing the output.

Looking at the imputation

Math Course:

smii_imp <- complete(smii,3)

Portuguese Language Course :

spii_imp <- complete(spii,3)
head(smii_imp)
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher
## 2     GP   F  17       U     GT3       T    1    1  at_home    other
## 3     GP   F  15       U     LE3       T    1    1  at_home    other
## 4     GP   F  15       U     GT3       T    4    4   health services
## 5     GP   F  16       U     GT3       T    3    3    other    other
## 6     GP   M  16       U     LE3       T    4    3 services    other
##       reason guardian traveltime studytime failures schoolsup famsup paid
## 1     course   mother          1         1        0       yes     no   no
## 2     course   father          1         1        0        no    yes   no
## 3      other   mother          1         1        3       yes     no  yes
## 4       home   mother          1         3        0        no    yes  yes
## 5       home   father          1         1        0        no    yes  yes
## 6 reputation   mother          1         1        0        no    yes  yes
##   activities nursery higher internet romantic famrel freetime goout Dalc
## 1         no     yes    yes       no       no      4        3     4    1
## 2         no      no    yes      yes       no      5        3     3    1
## 3         no     yes    yes      yes       no      4        3     4    1
## 4        yes     yes    yes      yes      yes      3        4     4    1
## 5         no     yes    yes       no       no      4        3     5    1
## 6        yes     yes    yes      yes       no      5        4     3    1
##   Walc health absences G1 G2 G3  id
## 1    1      3        6  5  6  6   3
## 2    1      3        4  5  5  6   4
## 3    3      3       10  7  8 10  34
## 4    1      5       12 15 14 15 331
## 5    4      5        4  6 10 10  10
## 6    1      5       10 15 15 15 332
head(spii_imp)
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher
## 2     GP   F  17       U     GT3       T    1    1  at_home    other
## 3     GP   F  15       U     LE3       T    1    1  at_home    other
## 4     GP   F  15       U     GT3       T    4    4   health services
## 5     GP   F  16       U     GT3       T    3    3    other    other
## 6     GP   M  16       U     LE3       T    4    3 services    other
##       reason guardian traveltime studytime failures schoolsup famsup paid
## 1     course   mother          1         1        0       yes     no   no
## 2     course   father          1         1        0        no    yes   no
## 3      other   mother          1         1        0       yes     no   no
## 4       home   mother          1         3        0        no    yes   no
## 5       home   father          1         3        0        no    yes   no
## 6 reputation   mother          1         3        0        no    yes   no
##   activities nursery higher internet romantic famrel freetime goout Dalc
## 1         no     yes    yes       no       no      4        3     4    1
## 2         no      no    yes      yes       no      5        3     3    1
## 3         no     yes    yes      yes       no      4        3     3    1
## 4        yes     yes    yes      yes      yes      3        4     3    1
## 5         no     yes    yes       no       no      4        3     3    1
## 6        yes     yes    yes      yes       no      5        4     3    1
##   Walc health absences G1 G2 G3  id
## 1    1      3        4  0 11 11   1
## 2    1      3        0  9 11 11  93
## 3    3      3        6 12 13 12 344
## 4    1      5        0 14 14 14 498
## 5    1      5        0 11 13 13 253
## 6    1      5        6 12 12 13 345

It seems like the imputation did a decent job. It replaced all of the NA values with one, which is close to two, but not quite!

Joining Datasets

Because my two dataframes both deal with students in either a Math course or a Portuguese language course at two secondary schools in Portugal (Gabriel Pereira and Mousinho da Silveira), I can combine the data by merging the students who are in both classes together.

The metadata of the datset I am using said that there is an overlap of 382 students that are in both the math class and the potuguese language class. They gave a suggestion on what to merge to capture the 382 students. Below is the code I used to merge both classes into one master data frame.

It was suggested to merge the two data sets by the following variables:

“school”,“sex”,“age”,“address”,“famsize”,“Pstatus”,“Medu”,“Fedu”,“Mjob”,“Fjob”,“reason”,“nursery”,“internet”

Here is the merge they suggested

All = merge(smi,spi,by=c("school","sex","age","address","famsize","Pstatus","Medu","Fedu","Mjob","Fjob","reason","nursery","internet"))
print(nrow(All)) # 382 students
## [1] 382

I am intrested in seeing if this combined data set actually contains 382 unique students who attend both schools, so I will do some more counts

All %>% count(school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,nursery,internet) %>% filter(n >1)
## # A tibble: 8 x 14
##   school sex     age address famsize Pstatus  Medu  Fedu Mjob  Fjob  reason
##   <chr>  <chr> <dbl> <chr>   <chr>   <chr>   <dbl> <dbl> <chr> <chr> <chr> 
## 1 GP     F        17 U       GT3     T           4     4 teac~ serv~ course
## 2 GP     F        18 U       GT3     T           1     1 other other home  
## 3 GP     M        16 U       GT3     T           3     3 serv~ other home  
## 4 GP     M        16 U       GT3     T           4     4 serv~ serv~ course
## 5 GP     M        16 U       GT3     T           4     4 teac~ teac~ course
## 6 GP     M        17 U       GT3     T           2     1 other other home  
## 7 GP     M        18 U       GT3     T           4     4 teac~ serv~ home  
## 8 MS     F        18 U       GT3     T           1     1 other other course
## # ... with 3 more variables: nursery <chr>, internet <chr>, n <int>

Seems like this isn’t truly unique counts of students. I guess I will take a crack at it!

I think I will merge the data with all variables to make sure that we can see what is going on I did not include G1,G2,G3 or paid variables. This is because the paid variable shouldn’t effect our merge and the grades need to be seperate for each class.

Merge by default uses an inner join to join the data, which is what I need for this join

All2 <- merge(smi,spi,by=c("school","sex","age","address","famsize","Pstatus",
                            "Medu","Fedu","Mjob","Fjob","reason","nursery","internet",
                            "guardian","traveltime","studytime","failures",
                            "schoolsup","famsup","activities","higher","romantic",
                            "famrel","freetime","goout","Dalc","Walc","health","absences"))
print(nrow(All2)) # 85 students
## [1] 85

Time to check if we have repeats!

All2 %>% count(school,sex,age,address,famsize,Pstatus,
                            Medu,Fedu,Mjob,Fjob,reason,nursery,internet, guardian,traveltime,studytime,failures,
                            schoolsup,famsup,activities,higher,romantic,
                            famrel,freetime,goout,Dalc,Walc,health,absences) %>% filter(n >1)
## # A tibble: 0 x 30
## # ... with 30 variables: school <chr>, sex <chr>, age <dbl>,
## #   address <chr>, famsize <chr>, Pstatus <chr>, Medu <dbl>, Fedu <dbl>,
## #   Mjob <chr>, Fjob <chr>, reason <chr>, nursery <chr>, internet <chr>,
## #   guardian <chr>, traveltime <dbl>, studytime <dbl>, failures <dbl>,
## #   schoolsup <chr>, famsup <chr>, activities <chr>, higher <chr>,
## #   romantic <chr>, famrel <dbl>, freetime <dbl>, goout <dbl>, Dalc <dbl>,
## #   Walc <dbl>, health <dbl>, absences <dbl>, n <int>

Looks like we do not have any repeated rows!

Working with a Master Dataframe

Now that I have joined my data, I want to make some dataframes that I can use for plots!

Master <- All2

I should rename some of my rows so my master dataframe is easier to interpret!

library(plyr)
 Master <- rename(Master, c("paid.x" = "paid.m", "G1.x" = "G1.m",  "G2.x" = "G2.m",  "G3.x" = "G3.m", "id.x" = "id.m", "paid.y" = "paid.p", "G1.y" = "G1.p",  "G2.y" = "G2.p", "G3.y" = "G3.p", "id.y" = "id.p"))

Renamed the x’s to m for Math course and the y’s to p for the Portuguese language class.

head(Master)
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
## 1     GP   F  15       R     GT3       T    2    2  at_home    other
## 2     GP   F  15       R     GT3       T    2    4 services   health
## 3     GP   F  15       R     GT3       T    3    4 services   health
## 4     GP   F  15       U     GT3       A    3    3    other   health
## 5     GP   F  15       U     GT3       A    4    3 services services
## 6     GP   F  15       U     GT3       T    1    1    other    other
##       reason nursery internet guardian traveltime studytime failures
## 1 reputation     yes       no   mother          1         1        0
## 2     course     yes      yes   mother          1         3        0
## 3     course     yes      yes   mother          1         3        0
## 4 reputation     yes       no   father          1         4        0
## 5 reputation     yes      yes   mother          1         2        0
## 6       home      no      yes   father          1         2        0
##   schoolsup famsup activities higher romantic famrel freetime goout Dalc
## 1       yes    yes        yes    yes       no      4        3     1    1
## 2       yes    yes        yes    yes       no      4        3     2    1
## 3       yes    yes        yes    yes       no      4        3     2    1
## 4       yes     no         no    yes       no      4        3     3    1
## 5        no    yes        yes    yes       no      4        3     2    1
## 6        no    yes        yes    yes       no      4        3     2    2
##   Walc health absences paid.m G1.m G2.m G3.m id.m paid.p G1.p G2.p G3.p
## 1    1      2        8    yes   14   13   13  305     no   14   13   12
## 2    1      5        2    yes   10    9    8  146     no   10   11   10
## 3    1      5        2    yes   12   12   11  237     no   11   12   12
## 4    1      4       10     no   10   11   11  157     no   10   10   10
## 5    1      1        0    yes   14   15   15  306     no   15   14   15
## 6    3      4        2     no    9   10   10  115     no   13   12   12
##   id.p
## 1  502
## 2  161
## 3  258
## 4  171
## 5  571
## 6  443

This datframe looks good, but I also want to make a data frame without any categorical variables!

I want to keep all of the possible predictor variables so that I can compare and contrast them with each other, but this next dataframe will just have numbers.

Making a dataframe with no categorical variables

Mastercor <- select(Master,-c(school,sex,address,famsize,Pstatus,Mjob,Fjob,reason,nursery,internet,guardian,schoolsup,famsup,activities,higher,romantic,paid.m,paid.p,id.m,id.p,G1.m,G2.m,G1.p,G2.p))

Plotting and Analyzing the Data!!

Time to get an idea of what is going with my dataset!

First I’m going to use some base functions to look at my data

head(Master)
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
## 1     GP   F  15       R     GT3       T    2    2  at_home    other
## 2     GP   F  15       R     GT3       T    2    4 services   health
## 3     GP   F  15       R     GT3       T    3    4 services   health
## 4     GP   F  15       U     GT3       A    3    3    other   health
## 5     GP   F  15       U     GT3       A    4    3 services services
## 6     GP   F  15       U     GT3       T    1    1    other    other
##       reason nursery internet guardian traveltime studytime failures
## 1 reputation     yes       no   mother          1         1        0
## 2     course     yes      yes   mother          1         3        0
## 3     course     yes      yes   mother          1         3        0
## 4 reputation     yes       no   father          1         4        0
## 5 reputation     yes      yes   mother          1         2        0
## 6       home      no      yes   father          1         2        0
##   schoolsup famsup activities higher romantic famrel freetime goout Dalc
## 1       yes    yes        yes    yes       no      4        3     1    1
## 2       yes    yes        yes    yes       no      4        3     2    1
## 3       yes    yes        yes    yes       no      4        3     2    1
## 4       yes     no         no    yes       no      4        3     3    1
## 5        no    yes        yes    yes       no      4        3     2    1
## 6        no    yes        yes    yes       no      4        3     2    2
##   Walc health absences paid.m G1.m G2.m G3.m id.m paid.p G1.p G2.p G3.p
## 1    1      2        8    yes   14   13   13  305     no   14   13   12
## 2    1      5        2    yes   10    9    8  146     no   10   11   10
## 3    1      5        2    yes   12   12   11  237     no   11   12   12
## 4    1      4       10     no   10   11   11  157     no   10   10   10
## 5    1      1        0    yes   14   15   15  306     no   15   14   15
## 6    3      4        2     no    9   10   10  115     no   13   12   12
##   id.p
## 1  502
## 2  161
## 3  258
## 4  171
## 5  571
## 6  443
summary(Master)
##     school              sex                 age          address         
##  Length:85          Length:85          Min.   :15.00   Length:85         
##  Class :character   Class :character   1st Qu.:15.00   Class :character  
##  Mode  :character   Mode  :character   Median :16.00   Mode  :character  
##                                        Mean   :16.39                     
##                                        3rd Qu.:17.00                     
##                                        Max.   :19.00                     
##    famsize            Pstatus               Medu          Fedu      
##  Length:85          Length:85          Min.   :0.0   Min.   :1.000  
##  Class :character   Class :character   1st Qu.:2.0   1st Qu.:2.000  
##  Mode  :character   Mode  :character   Median :3.0   Median :3.000  
##                                        Mean   :2.8   Mean   :2.647  
##                                        3rd Qu.:4.0   3rd Qu.:4.000  
##                                        Max.   :4.0   Max.   :4.000  
##      Mjob               Fjob              reason         
##  Length:85          Length:85          Length:85         
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##    nursery            internet           guardian           traveltime   
##  Length:85          Length:85          Length:85          Min.   :1.000  
##  Class :character   Class :character   Class :character   1st Qu.:1.000  
##  Mode  :character   Mode  :character   Mode  :character   Median :1.000  
##                                                           Mean   :1.353  
##                                                           3rd Qu.:2.000  
##                                                           Max.   :3.000  
##    studytime        failures        schoolsup            famsup         
##  Min.   :1.000   Min.   :0.00000   Length:85          Length:85         
##  1st Qu.:2.000   1st Qu.:0.00000   Class :character   Class :character  
##  Median :2.000   Median :0.00000   Mode  :character   Mode  :character  
##  Mean   :2.106   Mean   :0.09412                                        
##  3rd Qu.:3.000   3rd Qu.:0.00000                                        
##  Max.   :4.000   Max.   :3.00000                                        
##   activities           higher            romantic             famrel     
##  Length:85          Length:85          Length:85          Min.   :2.000  
##  Class :character   Class :character   Class :character   1st Qu.:4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median :4.000  
##                                                           Mean   :4.176  
##                                                           3rd Qu.:5.000  
##                                                           Max.   :5.000  
##     freetime         goout            Dalc            Walc      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :3.000   Median :3.000   Median :1.000   Median :2.000  
##  Mean   :3.212   Mean   :2.812   Mean   :1.282   Mean   :1.976  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:1.000   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :5.000   Max.   :3.000   Max.   :5.000  
##      health         absences         paid.m               G1.m      
##  Min.   :1.000   Min.   : 0.000   Length:85          Min.   : 6.00  
##  1st Qu.:3.000   1st Qu.: 0.000   Class :character   1st Qu.: 9.00  
##  Median :4.000   Median : 0.000   Mode  :character   Median :12.00  
##  Mean   :3.506   Mean   : 1.894                      Mean   :12.08  
##  3rd Qu.:5.000   3rd Qu.: 2.000                      3rd Qu.:14.00  
##  Max.   :5.000   Max.   :14.000                      Max.   :19.00  
##       G2.m            G3.m         id.m          paid.p         
##  Min.   : 5.00   Min.   : 0   Min.   : 16.0   Length:85         
##  1st Qu.:10.00   1st Qu.:10   1st Qu.:131.0   Class :character  
##  Median :12.00   Median :12   Median :256.0   Mode  :character  
##  Mean   :12.24   Mean   :12   Mean   :236.2                     
##  3rd Qu.:15.00   3rd Qu.:15   3rd Qu.:329.0                     
##  Max.   :19.00   Max.   :19   Max.   :395.0                     
##       G1.p            G2.p            G3.p            id.p      
##  Min.   : 5.00   Min.   : 8.00   Min.   : 0.00   Min.   :  8.0  
##  1st Qu.:11.00   1st Qu.:11.00   1st Qu.:11.00   1st Qu.:278.0  
##  Median :13.00   Median :12.00   Median :13.00   Median :434.0  
##  Mean   :12.75   Mean   :12.82   Mean   :13.21   Mean   :406.7  
##  3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:15.00   3rd Qu.:549.0  
##  Max.   :19.00   Max.   :18.00   Max.   :19.00   Max.   :649.0

Base Plot to look at my data

plot(Mastercor)

Hard to tell exactly what is going on, I think I will explore further by creating a correlation matrix of some variables I am intrested in so that I can start thinking about how I want to run a statistical model on this data.

Correlation

Going to do some correlation in hopes of understanding my data

library(dplyr)

#Removing variables that I don't need for my correlation analysis!
Mastercor <- select(Master,-c(school,sex,address,famsize,Pstatus,Mjob,Fjob,reason,nursery,internet,guardian,schoolsup,famsup,activities,higher,romantic,paid.m,paid.p,id.m,id.p,G1.m,G2.m,G1.p,G2.p))

# Running a correlation
Mastercorcor <- cor(Mastercor)

# Need this package to plot!
library(corrplot)
## corrplot 0.84 loaded
# Plotting my findings!
corrplot(Mastercorcor)

By looking at this correlation matrix, I can see some strong negative and positive correlations. Listed below are some that find intresting/and or want to analyze:

traveltime - age - moderate positive

Medu - Fedu - moderate positive

failures - Walc - moderate positive

failures - absences

Dalc - Walc - Strong postive

Failures - Medu and Fedu - strong negative

studytime - Walc and Dalc - moderate negative

goout - Walc - moderate positive

age - G3.m - moderate negative

study time - G3.p - moderate positive

failures - G3.p - moderate negative

health - G3.p - moderate negative

More Plots

Time to use some plots to determine the best statistical approach

Histograms of some important variables:

hist(Master$G3.m, border="black", 
     col="orange", main = "Histogram of Math Course Grades", xlab = "Grades for Math Course") 
  
abline(v=mean(Master$G3.m),col="blue")

abline(v=median(Master$G3.m),col="red")

hist(Master$G3.p, border="black", 
     col="purple", main = "Histogram of Portuguese Language Course Grades", xlab = "Grades for Portuguese Language Course") 
  
abline(v=mean(Master$G3.p),col="blue")

abline(v=median(Master$G3.p),col="red")

hist(Master$studytime, border="black", 
     col="yellow", main = "Histogram of Studytime", xlab = "Study Time (Hours)") 
  
abline(v=mean(Master$studytime),col="blue")

abline(v=median(Master$studytime),col="red")

hist(Master$failures, border="black", 
     col="red", main = "Histogram of Failures in Previous classes", xlab = "Failures") 
  
abline(v=mean(Master$failures),col="blue")

abline(v=median(Master$failures),col="green")

hist(Master$age, border="black", 
     col="green", main = "Histogram of Age", xlab = "Age") 
  
abline(v=mean(Master$age),col="blue")

abline(v=median(Master$age),col="red")

hist(Master$goout, border="black", 
     col="brown", main = "Histogram of Going out", xlab = "Going out") 
  
abline(v=mean(Master$goout),col="blue")

abline(v=median(Master$goout),col="red")

hist(Master$Fedu, border="black", 
     col="blue", main = "Histogram of Father's Education", xlab = "Father's Education") 
  
abline(v=mean(Master$Fedu),col="green")

abline(v=median(Master$Fedu),col="red")

hist(Master$Medu, border="black", 
     col="light blue", main = "Histogram of Mother's Education", xlab = "Mother's Education") 
  
abline(v=mean(Master$Medu),col="green")

abline(v=median(Master$Medu),col="red")

Statistical Modeling

Now that I have a good idea of what variables are significantly correlated and what the spread of my variables look like, I think I can start trying to run a statitical model!

Because the questions I want to answer deal with predicting grades and alcohol consumption using some of the variables in the dataset, regression seems like the best method to analyze my data. I can run multiple regression for my numeric variables and maybe even run a logestic regression for some of my categorical variables!

Data!

Changing my Master data frame so that it only has final grades and also removing some other unneeded variables

MasterR <-  select(Master,-c(id.m,id.p,G1.m,G2.m,G1.p,G2.p,paid.p,paid.m))

Running a Multiple Regression Model

I think I will do my first model looking at possibly predicting grades for the portuguese language class using the predictor variables in my dataframe. There are many different models I can run, but predicting grades and alcohol consumption were my main questions I was hoping to look at.

I am using Medu, Fedu and studytime in my model, because when I ran my correlation analysis, these three variables seemed to impact grade levels for the portuguese langauge course the most.

library(tidyverse)
#Fitting a basic model to try and predict grades for the Portuguese language course
fit1 <- lm(G3.p ~ Medu + Fedu + studytime, data = MasterR)

summary(fit1) 
## 
## Call:
## lm(formula = G3.p ~ Medu + Fedu + studytime, data = MasterR)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.129  -1.611   0.100   1.332   6.255 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6480     1.0425   9.255 2.49e-14 ***
## Medu          0.8653     0.3322   2.604   0.0109 *  
## Fedu         -0.2892     0.3707  -0.780   0.4376    
## studytime     0.9053     0.3482   2.600   0.0111 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.655 on 81 degrees of freedom
## Multiple R-squared:  0.1868, Adjusted R-squared:  0.1566 
## F-statistic:   6.2 on 3 and 81 DF,  p-value: 0.0007604

Interpretation:

First, we can see that the p-value of the F-statistic is 0.0007604, which is significant. This means that, at least, one of the predictor variables is significantly related to the outcome variable of grades in the portuguese language course.

In this first model, with Medu, Fedu, and studytime as predictor variables, the adjusted R2 = 0.1566. This means that, “15% of the variance in the measure the portuguese language course can be predicted mother’s education level, father’s education level and amount of time spent studying.”

Looking at beta coefficients

summary(fit1)$coefficient
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  9.6480445  1.0424674  9.255009 2.489157e-14
## Medu         0.8652896  0.3322393  2.604416 1.094586e-02
## Fedu        -0.2892115  0.3707151 -0.780145 4.375793e-01
## studytime    0.9053067  0.3481821  2.600096 1.107365e-02

By looking at the model we can see that for predicting grades in the Portuguese language course Fedu (Father’s Education Level) is not significant for our model. We can also see that study time has has a greater impact on grades in the portuguese language course than Medu (Mother’s education level), but only by a little.

Comparing models:

Because the Fedu variable is not significant, I am going to remove it from the model and compare the new model to the old one.

# New model without Fedu
fit2 <- lm(G3.p ~ Medu + studytime, data = MasterR)

summary(fit2) 
## 
## Call:
## lm(formula = G3.p ~ Medu + studytime, data = MasterR)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9473  -1.5623   0.1271   1.2078   6.1271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.3323     0.9584   9.737 2.47e-15 ***
## Medu          0.6894     0.2434   2.832  0.00582 ** 
## studytime     0.9256     0.3464   2.672  0.00909 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.649 on 82 degrees of freedom
## Multiple R-squared:  0.1806, Adjusted R-squared:  0.1607 
## F-statistic: 9.039 on 2 and 82 DF,  p-value: 0.0002834

In my model, with Medu and studytime predictor variables, the adjusted R2 = 0.1607. This means that “16% of the variance in the measure of grades in the portuguese language course can be predicted mother’s education level and amount of time spent studying.”

After analyzing my two models, I now want to compare my two models to see which is better at predicting grades for the portuguese language course

Comparing Model 1 and Model 2

anova(fit1, fit2)
## Analysis of Variance Table
## 
## Model 1: G3.p ~ Medu + Fedu + studytime
## Model 2: G3.p ~ Medu + studytime
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     81 571.05                           
## 2     82 575.34 -1   -4.2908 0.6086 0.4376

If the p-value is not sufficiently low (usually greater than 0.05), we should favor the simpler model. We can see from the table, that we have a Df of -1 (indicating that the less complex model has one less parameter). It also shows a p-value of (0.4376). Meaning that removing the Fedu Variable from the model did lead to a significantly improved fit over the first model.(Favor simpler model)

Plot to compare estimates with

library(jtools)

library(ggstance)
## 
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
## 
##     geom_errorbarh, GeomErrorbarh
plot_summs(fit1, scale = TRUE, fit2, plot.distributions = TRUE)

Checking all residuals to ensure there are no distinct patterns

 par(mfrow=c(2,2))
 plot(fit2)

I don’t see any distinct patterns, so I will move on to plotting my model

Plotting model for predicting grades for the portuguese language course

Just data points for each predictor variable plots

library(ggplot2)

library(jtools)

library(ggstance)

effect_plot(fit2, pred = studytime, interval = TRUE, plot.points = TRUE)

effect_plot(fit2, pred = Medu, interval = TRUE, plot.points = TRUE)

Now looking at estimates with distributions

plot_summs(fit2, scale = TRUE)

plot_summs(fit2, scale = TRUE, plot.distributions = TRUE, inner_ci_level = .9)

I wanna do a 3D plot to get the best view of my model!

Making the two predictors and the response into specific objects:

var_1 <- Master$studytime

var_2 <- Master$Medu

var_3 <- Master$G3.p

Plot time!

library(plotly)
## 
## Attaching package: 'plotly'
## The following objects are masked from 'package:plyr':
## 
##     arrange, mutate, rename, summarise
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
p <- plot_ly(Master, 
             x = var_1, 
             y = var_2, 
             z = var_3) %>%
  
  add_markers(color = ~G3.p) %>%
 
  layout(scene = list(xaxis = list(title = 'Study Time'),
                      yaxis = list(title = 'Mothers Education'),
                      zaxis = list(title = 'Final Grades for Portuguese Language Course ')))

3D Model!

p

Well…. I did some regression, but I also wanna look at what predictor variables may predict Weekend Alcohol Consumption… so I guess I’ll do some more regressions…

I will use failures, going out, and weekday alcohl in my model, because when I ran my correlation matrix and plotted my variables, these three seemed to have the most impact on weekend alcohol consumption.

fit3 <- lm(Walc ~ failures + goout + Dalc, data = MasterR)

summary(fit3) 
## 
## Call:
## lm(formula = Walc ~ failures + goout + Dalc, data = MasterR)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0133 -0.6250 -0.2592  0.4526  2.7986 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.4501     0.3356  -1.341  0.18366    
## failures      0.6790     0.2441   2.782  0.00673 ** 
## goout         0.2882     0.0888   3.245  0.00171 ** 
## Dalc          1.2106     0.1826   6.630 3.46e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8656 on 81 degrees of freedom
## Multiple R-squared:  0.4941, Adjusted R-squared:  0.4753 
## F-statistic: 26.37 on 3 and 81 DF,  p-value: 5.351e-12

Interpretation:

We can see that the p-value of the F-statistic is 5.351e-12, which is highly significant. This means that, at least, one of the predictor variables is significantly related to the outcome variable of weekend alcohol consumption.

Also: In this new model, with failures, goingout, and Dalc as predictor variables, the adjusted R2 = 0.4753 . This means that “47% of the variance in the measure of Weekend Alcohol consumption can be predicted by amount of failures in classes, number of times the student go’s out during the week, and by weekday alcohol consumption”

Looking at beta coefficients

summary(fit3)$coefficient
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) -0.4500896 0.33563175 -1.341022 1.836617e-01
## failures     0.6790207 0.24411888  2.781516 6.728045e-03
## goout        0.2881807 0.08879627  3.245414 1.705979e-03
## Dalc         1.2105529 0.18258589  6.630046 3.457956e-09

By looking at the model we can see that for predicting weekend alcohol consumption, all variables are significant for our model. We can also see that Dalc (Weekday alcohol consumption) has the a greatest impact on weekend alcohol consumption by a large margin. going out seems to have the least impact on weekend alcohol consumption, so it might be good to run another model without it.

(I Could also add variables to this model, but not many other predictor variables were significantlly correlated)

Comparing models:

Because the goingout variable isn’t as significant in this model as the other variables I am going to run a model without it!

fit4 <- lm(Walc ~ failures +  Dalc, data = MasterR)

summary(fit4) 
## 
## Call:
## lm(formula = Walc ~ failures + Dalc, data = MasterR)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8803 -0.5435 -0.5435  0.4565  3.4565 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.2799     0.2632   1.063  0.29073    
## failures      0.8096     0.2544   3.182  0.00206 ** 
## Dalc          1.2636     0.1921   6.577 4.19e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9145 on 82 degrees of freedom
## Multiple R-squared:  0.4283, Adjusted R-squared:  0.4143 
## F-statistic: 30.71 on 2 and 82 DF,  p-value: 1.109e-10

In this model, with failures and Dalc as predictor variables, the adjusted R2 = 0.4143. This means that “41% of the variance in the measure of Weekend Alcohol consumption can be predicted by amount of failures in classes and by weekday alcohol consumption”

This is a smaller R-squared value compared to my first model, so I will compare them with anova!

Comparing Model 3 and Model 4

anova(fit3, fit4)
## Analysis of Variance Table
## 
## Model 1: Walc ~ failures + goout + Dalc
## Model 2: Walc ~ failures + Dalc
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     81 60.689                                
## 2     82 68.581 -1   -7.8917 10.533 0.001706 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The resulting p-value is sufficiently low (less than 0.05), we conclude that the more complex model is significantly better than the simpler model, and thus favor the more complex model.

We can see from the table, that we have a Df of -1 (indicating that the less complex model has one less parameter). It also shows a p-value of (0.001706). Meaning that removing the goingout Variable from the model did not lead to a significantly improved fit over the first model and we should favor the more complex model.

Plotting my comparison

library(jtools)

library(ggstance)

plot_summs(fit3, scale = TRUE, fit4, plot.distributions = TRUE)

Checking all residuals to ensure there are no distinct patterns

 par(mfrow=c(2,2))
 plot(fit3)

I don’t see any distinct patterns, so I will move on to plotting my model

Plotting model for predicting Weekend Alcohol consumption

Just data points for each predictor variable plots

library(ggplot2)

library(jtools)

library(ggstance)

effect_plot(fit3, pred = Dalc, interval = TRUE, plot.points = TRUE)

effect_plot(fit3, pred = goout, interval = TRUE, plot.points = TRUE)

effect_plot(fit3, pred = failures, interval = TRUE, plot.points = TRUE)

Now looking at estimates with distributions

plot_summs(fit3, scale = TRUE)

plot_summs(fit3, scale = TRUE, plot.distributions = TRUE, inner_ci_level = .9)

This plot is great and I’d love to do a 3D model, but sadly, I am unsure how to plot this because it has 3 predictor variables, but I will still run a 3D model for my other model without the “goingout” Variable.

Last 3D Model (Without goingout variable)

Making the two predictors and the response into specific objects:

var_4 <- Master$Dalc

var_6 <- Master$failures
  
var_7 <- Master$Walc

Plot time!

library(plotly)

a <- plot_ly(Master, 
             x = var_4, 
             y = var_6, 
             z = var_7) %>%
  
  add_markers(color = ~Walc) %>%
 
  layout(scene = list(xaxis = list(title = 'Weekday Alcohol Consumption'),
                      yaxis = list(title = 'Class Failures'),
                      zaxis = list(title = 'Weekend Alcohol Consumption')))

3D Model of model 4!

a

Conclusion for Multiple Regression:

For Grades in Portuguese language course

For a 1 Unit increase in Mother’s education level, with all other predictors held constant, we would expect to see a 0.6894 increase in Grades for the Portuguese language course

And for a 1 Unit increase in study time, with all other predictors held constant, we would expect to see a 0.9256 increase in Grades for the Portuguese language course

For Weekend Alcohol Consumption:

For a 1 Unit increase failures, with all other predictors held constant, we would expect to see a 0.6790207 increase in weekend alcohol consimption

And for a 1 Unit increase in going out, with all other predictors held constant, we would expect to see a 0.2881807 increase in weekend alcohol consimption

And for a 1 Unit increase in Daily Alcohol Consumption, with all other predictors held constant, we would expect to see a 1.2105529 increase in weekend alcohol consimption

Conclusion

This assignment was fun, I learned a lot in terms of how to create a story and move through a statistical analysis.

In the future, it would be cool to run more regression models, because there are many variables to predict from and I didn’t look at grades for the math course.

(Also logestic regression with the categorical variables would be an intresting analysis)